Skip to content

only use nightly pytorch in ci #243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 1, 2025
Merged

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jul 26, 2025

Summary:

  • change ci to only use nightly since block_current_stream is not in stable yet
  • fix new errors in nightly version of pyre
    • remove fixme[29] about future not being a function
    • make reduce_scatter_quantized return Work object

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 26, 2025
@tushar00jain tushar00jain force-pushed the pr243 branch 5 times, most recently from 538ea8a to dfe09bf Compare July 26, 2025 01:12
Summary:
use http transport instead of pg transport -- pg transport fails to resolve address when running locally
@@ -382,7 +382,7 @@ def allreduce(self, tensor: torch.Tensor, should_quantize: bool = False) -> Work
)
else:
work = self._pg.allreduce([tensor], ReduceOp.SUM)
work.wait()
work.block_current_stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this partially solves it but it doesn't really help the case below with the tensor division

Ideally we wrap the future below in a Work object and then call .block_current_stream() on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this partially solves it but it doesn't really help the case below with the tensor division

in which case?

  • for nccl with cuda, the behavior should be the same as the existing one
  • for gloo with cuda, the tensor is on the gpu (after the host to device copy but the tensor arg is also on gpu) so the callback will also return immediately? iiuc the fut.wait() in the callback that i added will also return immediately
  • for gloo without cuda, based on what you said the callback will be called after the device to host copy has been completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wrap the future below in a Work object and then call .block_current_stream() on that

we also need to call work.wait() or work.block_current_stream() to make sure work finishes on the current stream first before the future runs on the current stream

@tushar00jain tushar00jain force-pushed the pr243 branch 8 times, most recently from 5507e47 to 61e177c Compare July 29, 2025 04:18
@tushar00jain tushar00jain force-pushed the pr243 branch 7 times, most recently from 4698c70 to d6b54f7 Compare July 29, 2025 18:09
@tushar00jain tushar00jain changed the title option 1 - use block_current to overlap compute/communication only use nightly pytorch in ci Jul 29, 2025
@tushar00jain tushar00jain force-pushed the pr243 branch 2 times, most recently from cb39d98 to 685e3c4 Compare July 29, 2025 22:32
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We should document in README that we require torch nightly

@tushar00jain tushar00jain force-pushed the pr243 branch 11 times, most recently from d650c7a to 09208a0 Compare July 31, 2025 02:47
Summary:
- call future.wait in callbacks to make sure the continuation executes after the future has completed
- set the stream correctly to execute callback scheduled by bucketized allreduce
Summary:
returns the work object so we can be more flexible with the usage
Summary:
- change ci to only use nightly since block_current_stream is not in stable yet
- fix new errors in nightly version of pyre
  - remove fixme[29] about future not being a function
  - make reduce_scatter_quantized return Work object
@tushar00jain tushar00jain merged commit b746582 into pytorch:main Aug 1, 2025
8 of 10 checks passed
@tushar00jain tushar00jain deleted the pr243 branch August 1, 2025 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants